Wrangling categorical data in R

نویسندگان

  • Amelia McNamara
  • Nicholas J. Horton
چکیده

Data wrangling is a critical foundation of data science, and wrangling of categorical data is an important component of this process. However, categorical data can introduce unique issues in data wrangling, particularly in real-world settings with collaborators and periodically-updated dynamic data. This paper discusses common problems arising from categorical variable transformations in R, demonstrates the use of factors, and suggests approaches to address data wrangling challenges. For each problem, we present at least two strategies for management, one in base R and the other from the ‘tidyverse.’ We consider several motivating examples, suggest defensive coding strategies, and outline principles for data wrangling to help ensure data quality and sound analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Wrangling for Big Data: Challenges and Opportunities

Data wrangling is the process by which the data required by an application is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis. Although there are widely used Extract, Transform and Load (ETL) techniques and platforms, they often require manual work from technical and domain experts at different stages of the process. When confront...

متن کامل

Towards Automated Relational Data Wrangling

It is well-known in data science that 80% of the work is devoted to preprocessing and only 20% to the actual machine learning or data mining step. This motivates us to explore different ways to (help) automate that preprocessing step. This note focusses on the question whether it is possible to (help) automate the data wrangling process for tabular data in data science.

متن کامل

Data Wrangling: Making data useful again

Data analysis has become an everyday business and advancements of data management routines open up new opportunities. Nevertheless, transforming and assembling newly acquired data into a suitable form remains tedious. It is often stated, that data cleaning is a critical part of the overall process, but also consumes sublime amounts of time and resources. Data Wrangling is not only about transfo...

متن کامل

Introducing Data Science to Undergraduates through Big Data: Answering Questions by Wrangling and Profiling a Yelp Dataset

There is an insatiable demand in industry for data scientists, and graduate programs and certificates are gearing up to meet this demand. However, there is agreement in the industry that 80% of a data scientist’s work consists of the transformation and profiling aspects of wrangling Big Data; work that may not require an advanced degree. In this paper we present hands-on exercises to introduce ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PeerJ PrePrints

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2017